Most Ligand-Based Benchmarks Measure Overfitting Rather than Accuracy
نویسندگان
چکیده
Undetected overfitting can occur when there are significant redundancies between training and validation data. We describe AVE, a new measure of training-validation redundancy for ligand-based classification problems that accounts for the similarity amongst inactive molecules as well as active. We investigated nine widely-used benchmarks for virtual screening and QSAR, and show that the amount of AVE bias strongly correlates with the performance of ligand-based predictive methods irrespective of the predicted property, chemical fingerprint, similarity measure, or previously-applied unbiasing techniques. Therefore, it is likely that the previously-reported performance of most ligand-based methods can be explained by overfitting to benchmarks rather than good prospective accuracy.
منابع مشابه
CPAR: Classification based on Predictive Association Rules
Recent studies in data mining have proposed a new classification approach, called associative classification, which, according to several reports, such as [7, 6], achieves higher classification accuracy than traditional classification approaches such as C4.5. However, the approach also suffers from two major deficiencies: (1) it generates a very large number of association rules, which leads to...
متن کاملA Compression Based Distance Measure for Texture
The analysis of texture is an important subroutine in application areas as diverse as biology, medicine, robotics and forensic science. While the last three decades have seen extensive research in algorithms to measure texture similarity, almost all existing methods require the careful setting of many parameters. There are many problems associated with a surfeit of parameters, the most obvious ...
متن کاملBUREAU OF THE CENSUS STATISTICAL RESEARCH DIVISION REPORT SERIES SRD Research Report Number: CENSUS/SRD/RR-93/04 THE OVERFITTING PRINCIPLES SUPPORTING AIC bY
In the context of statistical model estimation and selection, what is “over-fit”? . What is “overparameterization”? When is a “principle of parsimony” appropriate? Suggestive answers are usually given to such questions, rather than precise definitions and cathematical statistical results. In this article, we investigate some relations that yield asymptotic equality between a variate which is th...
متن کاملModelling Influence and Opinion Evolution in Online Collective Behaviour
Opinion evolution and judgment revision are mediated through social influence. Based on a large crowdsourced in vitro experiment (n = 861), it is shown how a consensus model can be used to predict opinion evolution in online collective behaviour. It is the first time the predictive power of a quantitative model of opinion dynamics is tested against a real dataset. Unlike previous research on th...
متن کاملOverfitting Reduction of Text Classification Based on AdaBELM
Overfitting is an important problem in machine learning. Several algorithms, such as the extreme learning machine (ELM), suffer from this issue when facing high-dimensional sparse data, e.g., in text classification. One common issue is that the extent of overfitting is not well quantified. In this paper, we propose a quantitative measure of overfitting referred to as the rate of overfitting (RO...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1706.06619 شماره
صفحات -
تاریخ انتشار 2017